Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition

نویسندگان

چکیده

Fine-grained image recognition aims to classify fine subcategories belonging the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires large-scale labeled dataset. However, annotated datasets for fine-grained difficult collect they generally require domain expertise during labeling process. In this study, we propose self-supervised transfer learning method Vision Transformer (ViT) learn finer representations without human annotations. Interestingly, it observed that existing methods using ViT (e.g., DINO) show poor patch-level semantic consistency, may be detrimental representations. Motivated by observation, consistency loss function encourages patch embeddings of overlapping area between two augmented views similar each other datasets. addition, explore effective strategies fully leverage models trained Contrary literature, our findings indicate training only last block learning. We demonstrate effectiveness proposed approach through extensive experiments six classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Dogs. Under linear evaluation protocol, achieves average accuracy 78.5%, outperforming method, yields 77.2%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weakly-supervised Discriminative Patch Learning via CNN for Fine-grained Recognition

Research on fine-grained recognition has recently shifted from multistage frameworks to convolutional neural networks (CNN) that are trained end-to-end. Many previous end-to-end deep approaches typically consist of a recognition network and an auxiliary localization network trained with additional part annotations to detect semantic parts shared across classes. To avoid the cost of extra semant...

متن کامل

PatchIt: Self-Supervised Network Weight Initialization for Fine-grained Recognition

ConvNet training is highly sensitive to initialization of the weights. A widespread approach is to initialize the network with weights trained for a different task, an auxiliary task. The ImageNet-based ILSVRC classification task is a very popular choice for this, as it has shown to produce powerful feature representations applicable to a wide variety of tasks. However, this creates a significa...

متن کامل

Weakly Supervised Fine-Grained Image Categorization

In this paper, we categorize fine-grained images without using any object / part annotation neither in the training nor in the testing stage, a step towards making it suitable for deployments. Fine-grained image categorization aims to classify objects with subtle distinctions. Most existing works heavily rely on object / part detectors to build the correspondence between object parts by using o...

متن کامل

Exemplar-Specific Patch Features for Fine-Grained Recognition

In this paper, we present a new approach for fine-grained recognition or subordinate categorization, tasks where an algorithm needs to reliably differentiate between visually similar categories, e.g., different bird species. While previous approaches aim at learning a single generic representation and models with increasing complexity, we propose an orthogonal approach that learns patch represe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2023

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app131810493